Partitioning Parallel Documents Using Binary Segmentation

نویسندگان

  • Jia Xu
  • Richard Zens
  • Hermann Ney
چکیده

In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents. The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality. We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough. In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of changes in variance using binary segmentation and optimal partitioning

This work explores the performance of binary segmentation and optimal partitioning in the context of detecting changes in variance for time-series. Both, binary segmentation and optimal partitioning, are based on cost functions that penalise a high amount of changepoints in order to avoid overfitting. Analysis is performed on simulated time-series; first on Normal data with constant but unknown...

متن کامل

Multi-organ Segmentation Using Vantage Point Forests and Binary Context Features

Dense segmentation of large medical image volumes using a labelled training dataset requires strong classifiers. Ensembles of random decision trees have been shown to achieve good segmentation accuracies with very fast computation times. However, smaller anatomical structures such as muscles or organs with high shape variability present a challenge to them, especially when relying on axis-paral...

متن کامل

Binary Space Partitioning and Sparse Geometric Wavelets Representation for Image Compression

For low bit-rate compression applications, segmentation-based coding methods provide, in general, high compression ratios when compared with traditional (e.g., transform and subband) coding approaches. In this paper, we present a segmentation based image coding method that divides the desired image using binary space partitioning (BSP). Geometric wavelet is a recent development in the field of ...

متن کامل

Time Complexity Analysis of Binary Space Partitioning Scheme for Image Compression

— Segmentation-based image coding methods provide high compression ratios when compared with traditional image coding approaches like the transform and sub band coding for low bit-rate compression applications. In this paper, a segmentation-based image coding method, namely the Binary Space Partition scheme, that divides the desired image using a recursive procedure for coding is presented. The...

متن کامل

High Performance Implementation of Fuzzy C-Means and Watershed Algorithms for MRI Segmentation

Image segmentation is one of the most common steps in digital image processing. The area many image segmentation algorithms (e.g., thresholding, edge detection, and region growing) employed for classifying a digital image into different segments. In this connection, finding a suitable algorithm for medical image segmentation is a challenging task due to mainly the noise, low contrast, and steep...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006